Predictive Checks + Final Project

Tuesday, March 12

Today we will…

  • Data Intro + Cleaning: Feedback
  • New Material:
    • Predictive Checks
  • Final Project Work Time

Data Intro + Cleaning: Feedback

Your grade reflects the completeness of your submission, not the correctness!

  • Describe how the data were collected, if provided.
    • E.g., how do we know the average life expectancy?
  • Explain why you performed your chosen cleaning steps.
  • This assignment is to assess your learning in this course. You are expected to use functions and code techniques from this class.
    • You will lose points for using non-tidyverse functions to do tasks that we have discussed in this class!

Data Intro + Cleaning: Feedback

Code:

  • All code should be hidden or folded – use echo: false or code-fold: true.
  • Don’t name the R functions you have used (“We used str_detect to…”.).
    • Instead, describe what you did in plain English.
  • Don’t use dataset or variable names in the text.
    • Say “We removed missing values from per capita GDP.” rather than “We removed NA from per_cap_gdp.”
  • Don’t print out the head of the data!

Data Intro + Cleaning: Feedback

Citations:

  • Cite your sources, including:
    • data sources.
    • description of your variables that is not general knowledge.
  • You should have both in-line citations and a References section at the end of your report.
  • You may find the visual editor > insert > @citations useful

Data Intro + Cleaning: Feedback

Style + Organization:

  • Define all acronyms, especially any that are related to the variables of interest.
  • Everything should be in paragraph form – no bullets or numbered lists.
  • Read through your paper from top to bottom to make sure the organization makes sense.
    • At what point might someone get confused?

Predictive Checks

Any good analysis should include a check of the “adequacy of the fit of the model to the data and the plausibility of the model…” – Andrew Gelman

Predictive Checks

Predictive checks allow us to assess if our fitted model would produce data similar to the data that we observed.

  • Yes? Our model is a good fit.
  • No? Our model is not a good fit.

This is an assessment of model fit.

Caution

Predictive checks are not aimed to make predictions of the response variable for new observations of the explanatory variable.

Recall: Linear Regression

For simple linear regression, we assume the responses can be modeled as a linear function of the explanatory variable and some error.

\[y = \beta_0 + \beta_1 x_1 + \varepsilon\]

We also assume that those errors \((\varepsilon)\) follow a normal distribution with mean 0 and standard deviation \(\sigma\).

\[\varepsilon \sim N(0, \sigma)\]

Recall: Linear Regression

Therefore, the data we would expect to come from this model can be generated by:

  1. predicting values from a fitted model (\(\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x_1\)) …

and

  1. … adding normally distributed errors.

Recall: Linear Regression

This method produces data that perfectly agree with the linear model conditions:

Linear relationship between \(x\) and \(y\).

Independence of observations.

Normality of residuals.

Equal variance of residuals.

Predictive Checks

If we compare data generated from the linear model to the observed data, we can determine how well the observed data and linear model fit.

  • Is it plausible that the observed data could be generated by the model?

The Process

To perform a predictive check…

  1. Fit a regression model to the observed data.

  2. For a set of explanatory values, obtain predicted response values from the model.

  3. Add random errors to the predictions.

  4. Compare the simulated data to the observed data.

  5. Iterate!

The Process

To perform a predictive check…

  1. Fit a regression model to the observed data.

Use the lm() function…

The Process

To perform a predictive check…

  1. Fit a regression model to the observed data.

  2. For a set of explanatory values, obtain predicted response values from the model.

Use the predict() function…

The Process

To perform a predictive check…

  1. Fit a regression model to the observed data.

  2. For a set of explanatory values, obtain predicted response values from the model.

  3. Add random errors to the predictions.

Use the rnorm() function…

The random errors have mean 0 and standard deviation estimated by the residual standard error (use sigma()).

The Process

To perform a predictive check…

  1. Fit a regression model to the observed data.

  2. For a set of explanatory values, obtain predicted response values from the model.

  3. Add random errors to the predictions.

  4. Compare the simulated data to the observed data.

Use the lm() function to regress observed on simulated…

To measure similarity, record \(R^2\) (proportion of variability in \(y\) explained by a linear relationship with \(x\)).

The Process

To perform a predictive check…

  1. Fit a regression model to the observed data.

  2. For a set of explanatory values, obtain predicted response values from the model.

  3. Add random errors to the predictions.

  4. Compare the simulated data to the observed data.

  5. Iterate!

Use the map() function to repeat the process over and over…

We want to see how the model performs across many simulated datasets.

  • Compute the \(R^2\) for each.

Instead of \(R^2\), could use correlation \((r)\), sum of squared errors \((SSE)\), or the estimate of \(\sigma\) \((RMSE)\) to measure similarity.

Distribution of Simulated \(R^2\)

Plot the distribution of simulated \(R^2\) values to see how well the model performs.

  • Values distributed near 1 indicate a good fit!

For your project…

For your group project, you will run predictive checks to assess how well your model performs.

  • This is Section 3 of the Project Details page.

To do…

  • Game Plan Survey

  • Course Evaluation

    • Closes Friday, 3/15 at 11:59pm
  • Final Project Report

    • Due Friday, 3/15 at 11:59pm
  • Final Exam – Thursday 3/21

    • Section 70 (9-11am): 10:10am - 1:00pm
    • Section 71 (12-2pm): 1:10am - 4:00pm

Thursday, March 12

Today we will…

  • Linear Regression: Feedback
  • Final Exam: What to Expect
  • Remaining Q & A
  • R Hex Cookies!
  • Final Project Work Time

Linear Regression: Feedback

  • Think about the readability of the numbers you are presenting.

    • Do you need 6 decimal places?
    • Is scientific notation easily understood by the public?
  • Include units on your plots!

  • If you do any transformations, make sure you mention them.

    • Also make sure they are clear on any plots!

Linear Regression: Feedback

  • When you present a plot or a table, discuss in words what you want the reader to take away from it.
    • Discuss the table of variances as part of your discussion of model fit.
  • If you are modeling the average across years (or one particular year) make sure you include a plot of the average (or that year) in addition to the full data.

Linear Regression: Feedback

  • Some of you used a ratio of the response to the explanatory to show the relationship over time.
    • This is often not easy to understand or interpret.
    • I encourage you to find a clearer way to display this relationship over time.
    • If you choose to go this route, you will need lots of clear explanation about what ratio you are calculating and what it means.

Final Exam: What to Expect

  • COMING SOON

Final Exam: What to Expect

The exam is cumulative and will definitely contain questions on:

  • Data manipulations with dplyr and tidyr.
  • Data visualizations with ggplot.
  • Function writing.
  • Functional programming with map,
  • Statistical modeling with lm.

Finals Week: Office Hours

Warning

Wednesday (3/20) – 12:10pm - 2pm; 4:10pm - 5pm

Q & A

Cookies!

To do…

  • Game Plan Survey

  • Course Evaluation

    • Closes Friday, 3/15 at 11:59pm
  • Final Project Report

    • Due Sunday, 3/17 at 11:59pm
  • Final Exam – Thursday 3/21

    • Office Hours: Wed. – 12:10pm - 2pm; 4:10pm - 5pm
    • Section 70 (9-11am): 10:10am - 1:00pm
    • Section 71 (12-2pm): 1:10am - 4:00pm